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Abstract 

Despite the recent progress towards efficient multiple kernel learning (MKL), the structured 
output case remains an open research front. Current approaches involve repeatedly solving a batch 
learning problem, which makes them inadequate for large scale scenarios. We propose a new 
family of online proximal algorithms for MKL (as well as for group-LASSO and variants thereof), 
which overcomes that drawback. We show regret, convergence, and generalization bounds for the 
proposed method. Experiments on handwriting recognition and dependency parsing testify for the 
successfulness of the approach. 

1 Introduction 

Structured prediction (Lafferty et al., 2001; Taskar et al., 2003; Tsochantaridis et al., 2004) deals with 
problems with a strong interdependence among the output variables, often with sequential, graphical, 
or combinatorial structure. Despite recent advances toward a unified formalism, obtaining a good 
predictor often requires a significant effort in designing kernels {i.e., features and similarity measures) 
and tuning hyperparameters. The slowness in training structured predictors in large scale settings 
makes this an expensive process. 

The need for careful kernel engineering can be sidestepped using the kernel learning approach 
initiated in Bach et al. (2004); Lanckriet et al. (2004), where a combination of multiple kernels is 
learned from the data. While multi-class and scalable multiple kernel learning (MKL) algorithms have 
been proposed (Sonnenburg et al., 2006; Zien and Ong, 2007; Rakotomamonjy et al., 2008; Chapelle 
and Rakotomamonjy, 2008; Xu et al., 2009; Suzuki and Tomioka, 2009), none are well suited for 
large-scale structured prediction, for the following reason: all involve an inner loop in which a standard 
learning problem {e.g., an SVM) is repeatedly solved; in large-scale structured prediction, it is often 
prohibitive to tackle this problem in its batch form, and one typically resorts to online methods (Bottou, 
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1991; Collins, 2002; Ratliff et al., 2006; Collins et al., 2008). These methods are fast in achieving low 
generalization error, but converge slowly to the training objective, thus are unattractive for repeated 
use in the inner loop. 

In this paper, we overcome the above difficulty by proposing a stand-alone online MKL algorithm. 
The algorithm is based on the kernelization of the recent forward-backward splitting scheme FOBOS 
Duchi and Singer (2009) and iterates between subgradient and proximal steps. In passing, we improve 
the FOBOS regret bound and show how to efficiently compute the proximal projections associated with 
the squared £i-norm, despite the fact that the underlying optimization problem is not separable. 

After reviewing structured prediction and MKL (§2), we present a wide class of online proximal 
algorithms (§3) which extend FOBOS by handling composite regularizers with multiple proximal steps. 
These algorithms have convergence guarantees and are applicable in MKL, group-LASSO (Yuan and 
Lin, 2006) and other structural sparsity formalisms, such as hierarchical LASSO/MKL Bach (2008b); 
Zhao et al. (2008), group-LASSO with overlapping groups Jenatton et al. (2009), sparse group-LASSO 
(Friedman et al., 2010), and the elastic net MKL (Tomioka and Suzuki, 2010). We apply our MKL 
algorithm to structured prediction (§4), using the two following testbeds: sequence labeling for hand- 
written text recognition, and natural language dependency parsing. We show the potential of our 
approach by learning combinations of kernels from tens of thousands of training instances, with en- 
couraging results in terms of runtimes, accuracy and identifiability. 

2 Structured Prediction, Group Sparsity, and Multiple Kernel Learn- 
ing 

Let X and y be the input and output sets, respectively. In structured prediction, to each input x ^ X 
corresponds a (structured and exponentially large) set y{x) C 3^ of legal outputs; e.g., in sequence 
labeling, each x G is an observed sequence and each y G y{x) is the corresponding sequence of 
labels; in parsing, each x G is a string, and each y G y{x) is a parse tree that spans that string. 

Let U = {(x, y) I X G A', y G y{x)} be the set of all legal input-output pairs. Given a labeled 
dataset V = {(xi, yi), . . . , (x^, ym)} ^ we want to learn a predictor h : X ^ y of the form 



where / : — > M is a compatibility function. Problem ( 1 ) is called inference (or decoding) and involves 
combinatorial optimization {e.g., dynamic programming). In this paper, we use linear functions, 
/(x, y) = {6, 4>{x, y)), where is a parameter vector and 0(x, y) a feature vector. The structure of the 
output is usually taken care of by assuming a decomposition of the form 0(x, y) = Ylr&TZ ^A^i ?/»■)> 
where 7^ is a set of parts and the yr are partial output assignments (see (Taskar et al., 2003) for de- 
tails). Instead of explicit features, one may use a positive definite kernel, K : U xU ^ M., and let / 
belong to the induced RKHS T-Lr- Given a convex loss function L : x X x y ^M., the learning 
problem is usually formulated as a minimization of the regularized empirical risk: 



where A > is a regularization parameter and || . is the norm in T-Lk- In structured prediction, the 



h{x) = arg max f{x,y), 
y&y{x) 



(1) 




(2) 
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logistic loss (in CRFs) and the structured hinge loss (in SVMs) are common choices: 



LcRF{f;x,y) = logY^y,^y(^^-jexp{f{x,y') - f{x,y)), (3) 
LsYM{f;x,y) = maxy>(,y(^^) f{x,y') - f{x,y) +£{y',y). (4) 

In (4), £ : y X y ^ M+ is a user-given cost function. The solution of (2) can be expressed as a kernel 
expansion (structured version of the representer theorem (Hofmann et al., 2008, Corollary 13)). 

In the kernel learning framework Bach et al. (2004); Lanckriet et al. (2004), the kernel is expressed 
as a convex combination of elements of a finite set {Ki, . . . , Kp}, the coefficients of which are learned 
from data. That is, K £ JC, where 

IC^{K = ZUf^jK,\f3€AP], with AP^{/3gM^ I EMi = l}- 

The so-called MKL problem is the minimization of (2) with respect to K. Letting T-Lk. = ©j=i T^Kj 
be the direct sum of the RKHS, this optimization can be written (as shown in (Bach et al., 2004; 
Rakotomamonjy et al., 2008)) as: 

r=argmhi - ( X] H^jH^^J + " ^( ^0 ' 

^j=l ^ i=l ^j=l 

where the optimal kernel coefficients are /3* = / Z]f=i ll/rilwA- • explicit features, the 

J J j I 

parameter vector is split into p groups, = {6i, . . . , Op), and the minimization in (6) becomes 

A 1 

e* = arg min H VL(0;xi,yi), (7) 

0eRd 2 ' m ^ 
1=1 

where ||0||2,i — Z]j=i ll^jll is a sum of ^2-norms, called the mixed £2,i-norm. The group-LASSO 
criterion (Yuan and Lin, 2006) is similar to (7), without the square in the regularization term, revealing 
a close relationship with MKL (Bach, 2008a). In fact, the two problems are equivalent up to a change 
of A. The ^2,1-iiorm regularizer favors group sparsity: groups that are found irrelevant tend to be 
entirely discarded. 

Early approaches to MKL (Lanckriet et al., 2004; Bach et al., 2004) considered the dual of (6) 
in a QCQP or SOCP form, thus were limited to small scale problems. Subsequent work focused on 
scalability: in (Sonnenburg et al., 2006), a semi-infinite LP formulation and a cutting plane algorithm 
are proposed; SimpleMKL (Rakotomamonjy et al., 2008) alternates between learning an SVM and a 
gradient-based (or Newton Chapelle and Rakotomamonjy (2008)) update of the kernel weights; other 
techniques include the extended level method (Xu et al., 2009) and SpicyMKL (Suzuki and Tomioka, 
2009), based on an augmented Lagrangian method. These are all batch algorithms, requiring the 
repeated solution of problems of the form (2); even if one can take advantage of warm-starts, the 
convergence proofs of these methods, when available, rely on the exactness (or prescribed accuracy in 
the dual) of these solutions. 

In contrast, we tackle (6) and (7) in primal form. Rather than repeatedly calling off-the-shelf 
solvers for (2), we propose a stand-alone online algorithm with runtime comparable to that of solving 
a single instance of (2) by online methods (the fastest in large-scale settings (Shalev-Shwartz et al., 
2007; Bottou, 1991)). This paradigm shift paves the way for extending MKL to structured prediction, 
a large territory yet to be explored. 
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3 Online Proximal Algorithms 



We frame our online MKL algorithm in a wider class of online proximal algorithms. The theory of 
proximity operators (Moreau, 1962), which is widely known in optimization and has recently gained 
prominence in the signal processing community (Combettes and Wajs, 2006; Wright et al., 2009), 
provides tools for analyzing these algorithms and generalizes many known results, sometimes with 
remarkable simplicity. We thus start by summarizing its important concepts in §3.1, together with a 
quick review of convex analysis. 

3.1 Convex Functions, Subdifferentials, Proximity Operators, and Moreau Projec- 
tions 

Throughout, we let 99 : MP ^ M (where M = M U {+00}) be a convex, lower semicontinuous (Isc) 
(the epigraph epiy? = {(x, t) G M?' x M | ip{x) < t} is closed in M^'xM), and proper (3x : (^(x) / +00) 
function. The subdifferential of ip at xq is the set 

difi^o) ^ {g G M'^ I Vx G (^(x) - 99(xo) > g^(x - xo)}, 

the elements of which are the subgradients. We say that is G-Lipschitz. in 5 C M°' if Vx G 5, Vg G 
5(/9(x), ||g|| < G. We say that p is a -strongly convex in S if 

VxoGcS, VgGav?(xo), VxGM^ 99(x) >99(xo)+g^(x-xo) + (a/2)||x-xof. 

The Fenchel conjugate of is 99* : — t- M, y?*(y) — sup^ y^x— V3(x). Let: 

M<^(y) = inf ^||x-y|p + V?(x), and prox^(y) = arg inf ^ ||x - y f + (/^(x); 

the function : — )• M is called the Moreau envelope of ip, and the map prox^ : — )■ is the 
proximity operator of p (Combettes and Wajs, 2006; Moreau, 1962). Proximity operators generalize 
Euclidean projectors: consider the case ip = iq, where C C is a convex set and lq denotes its 
indicator (i.e., V3(x) = if x G C and +00 otherwise). Then, prox^^ is the Euclidean projector onto C 
and M^p is the residual. Two other important examples of proximity operators follow: 

. if(^(x) = (A/2)||x||2,thenprox^(y) = y/(l + A); 

• if (/^(x) = t||x||i, then prox^(y) = soft(y,T) is the soft-threshold function Wright et al. 
(2009), defined as [soft(y, t% = sgn(yfc) • max{0, \yk\ - r}. 

If ip : M"^! X ... X R'^p -)> M is (group-)separable, i.e., (^(x) = Y.k=i fki'^k), where x^ G M'^'=, 
then its proximity operator inherits the same (group-)separability: [prox^(x)]fc = prox^^ (x^) Wright 
et al. (2009). For example, the proximity operator of the mixed ^2,1-norm, which is group-separable, 
has this form. The following proposition, that we prove in Appendix A, extends this result by showing 
how to compute proximity operators of functions (maybe not separable) that only depend on the £2- 
norms of groups of components; e.g., the proximity operator of the squared ^2,1-norm reduces to that 
of squared £1. 

Proposition 1 Let ip : R'^^ x . . . x R^p ^ R be of the form p{xi, . . . ,Xp) = ^(||xi||, . . . , ||xp||) 
for some ip : R^ ^ R. Then, M^{xi, . . . ,Xp) = M^(||xi||, . . . , ||xp||) and [prox^(xi, . . . ,Xp)]fc = 
[prox^(||xi||, . . . , ||xp||)]fc(xfc/||xfc||). 
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Finally, we recall the Moreau decomposition, relating the proximity operators of Fenchel conju- 
gate functions (Combettes and Wajs, 2006) and present a corollary (proved in Appendix B) that is the 
key to our regret bound in §3.3. 

Proposition 2 (Moreau (1962)) For any convex, he, proper function : — )• M, 

X = prox^(x) + prox^*(x) and ||x|| ^2 = M^(x) + M<^*(x). (8) 

Corollary 3 Let Lp -.W ^M.be as in Prop. 2, and x = prox^(x). Then, any y satisfies 

||y - xf - ||y - xf < 2((^(y) - (^(x)). (9) 
Although the Fenchel dual y?* does not show up in (9), it has a crucial role in proving Corollary 3. 

3.2 A General Online Proximal Algorithm for Composite Regularizers 

The general algorithmic structure that we propose and analyze in this paper, presented as Alg. 1 , deals 
(in an online ' fashion) with problems of the form 

^ m 

mmXR{9) + -y2L{e;x„yi), (10) 
1=1 

where C R*^ is convex- and the regularizer R has a composite form R{0) = '}2'j=i ^j(^)- Like 
stochastic gradient descent (SGD (Bottou, 1991)), Alg. 1 is suitable for problems with large m; it also 
performs (sub-)gradient steps at each round (line 4), but only w.r.t. the loss function L. Obtaining 
a subgradient typically involves inference using the current model; e.g., loss-augmented inference, if 
L = LsvM> or marginal inference if L = LcRF- Our algorithm differs from SGD by the inclusion 
of J proximal steps w.r.t. to each term Rj (line 7). As noted in (Duchi and Singer, 2009; Langford 
et al., 2009), this strategy is more effective than standard SGD for sparsity-inducing regularizers, 
due to their usual non-differentiability at the zeros, which causes oscillation and prevents SGD from 
returning sparse solutions. 

When J = 1, Alg. 1 reduces to FOBOS (Duchi and Singer, 2009), which we kemelize and apply 
to MKL in §3.4. The case J > 1 has applications in variants of MKL or group-LASSO with composite 
regularizers (Tomioka and Suzuki, 2010; Friedman et al., 2010; Bach, 2008b; Zhao et al., 2008). In 
those cases, the proximity operators of -Ri , . . . , Rj are more easily computed than that of their sum R, 
making Alg. 1 more suitable than FOBOS. We present a few particular instances (all with = W^). 

Projected subgradient with groups. Let J = 1 and R be the indicator of a convex set G' C W^. 
Then (see §3.1), each proximal step is the Euclidean projection onto 0' and Alg. 1 becomes the online 
projected subgradient algorithm from (Zinkevich, 2003). Letting 0' = {0 G M | ||0||2,i < 7} yields 
an equivalent problem to group-LASSO and MKL (7). Using Prop. 1, each proximal step reduces to 
a projection onto a £i-ball whose dimension is the number of groups (see a fast algorithm in (Duchi 
et al., 2008)). 

'For simplicity, we focus on the pure online setting, i.e., each parameter update uses a single observation; analogous 
algorithms may be derived for the batch and mini-batch cases. 

^We are particularly interested in the case where G is a "vacuous" constraint whose goal is to confine each iterate 
6t to a region containing the optimum, by virtue of the projection step in line 9. The analysis in §3.3 will make this more 
clear. The same trick is used in Pegasos (Shalev-Shwartz et al., 2007). 
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Algorithm 1 Online Proximal Algorithm 

1: input: dataset V, parameter A, number of rounds T, learning rate sequence {r]t)t=i,...,T 

2: initialize 6i = 0; set m = \V\ 

3: for t = 1 to T do 

4: take a training pair {xt, yt) and obtain a subgradient g G dL{6t; xt, yt) 

5: et = 6t- rjtg (gradient step) 

6: for J = 1 to J do 

7: Ot+j/j = prox^^),ji^ (proximal step) 

8: end for 

9: Ot+i = Ile{Ot+i) (projection step) 

10: end for 

11: output: the last model Ot+i or the averaged model = ^ Ylt=i 



Algorithm 2 Moreau Projection for if 

1: input: vector x G M'' and parameter A > 

2: sort the entries of |x| into y (i.e., such that yi > ■ ■ ■ > yp) 

3: find p = max |j e {1, . . . ,p} \ yj - (A/(l + jX)) Y^l^^ yr > o} 

4: output: z = soft(x, r), where r = (A/ (1 + pA)) Ylr=i Vr 



Truncated subgradient with groups. Let J = 1 and R{0) = ||0||2,i, so that (10) becomes the 
usual formulation of group-LASSO, for a general loss L. Then, Alg. 1 becomes a group version of 
truncated gradient descent (Langford et al., 2009), studied in (Duchi and Singer, 2009) for multi-task 
learning. Similar batch algorithms have also been proposed (Wright et al., 2009). The reduction from 
^2,1 to ii can again be made due to Prop. 1 ; and each proximal step becomes a simple soft thresholding 
operation (as shown in §3.1). 

Proximal subgradient for squared mixed £2,1- With R{6) = ^ \\9\\l 1, we have the MKL problem 
(7). Prop. 1 allows reducing each proximal step w.r.t. the squared £2,1 to one w.r.t. the squared £1; 
however, unlike in the previous example, squared £1 is not separable. This apparent difficulty has 
led some authors {e.g., Suzuki and Tomioka (2009)) to remove the square from R, which yields the 
previous example. However, despite the non-separability of R, the proximal steps can still be effi- 
ciently computed: see Alg. 2. This algorithm requires sorting the weights of each group, which has 
0{plogp) cost; we show its correctness in Appendix F. Non-MKL applications of the squared £2,1 
norm are found in (Kowalski and Tonesani, 2009; Zhou et al., 2010). 

Other variants of group-LASSO and MKL. In hierarchical lasso and group-LASSO with over- 
laps (Bach, 2008b; Zhao et al., 2008; Jenatton et al., 2009), each feature may appear in more than 
one group. Alg. 1 handles these problems by enabling a proximal step for each group. Sparse 
group-LASSO (Friedman et al., 2010) simultaneously promotes group-sparsity and sparsity within 
each group, by using R{0) = o-||0||2.i + (1 — ct)||0||i; Alg. 1 can handle this regularizer by us- 
ing two proximal steps, both involving simple soft-thresholding: one at the group level, and another 
within each group. In non-sparse MKL ((Kloft et al., 2010), §4.4), i? = 5 ELi ll^fcT- Invoking 
Prop. 1 and separability, the resulting proximal step amounts to solving p scalar equations of the form 
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X — xq + Xrjtqx'^ ^ = 0, also valid for > 2 (unlike the method described in (Kloft et al., 2010)). 
3.3 Regret, Convergence, and Generalization Bounds 

We next show that, for a convex loss L and under standard assumptions, Alg. 1 converges up to e 
precision, with high confidence, in 0(l/e^) iterations. If L or i? are strongly convex, this bound is 
improved to 0(l/e), where O hides logarithmic terms. Our proofs combine tools of online convex 
programming (Zinkevich, 2003; Hazan et al., 2007) and classical results about proximity operators 
(Moreau, 1962; Combettes and Wajs, 2006). The key is the following lemma (that we prove in Ap- 
pendix C). 

Lemma 4 Assume that \/{x,y) G U, the loss L(-;x,y) is convex and G-Lipschitz on Q, and that 
the regularizer R = Ri + . . . + Rj satisfies the following conditions: (i) each Rj is convex; (ii) 
\/6 G Q, Vj' < j, Rj'{0) > Rj'(prox^ji.{0)) (each proximity operator pioxxji. does not increase 
the previous Rj'); (iii) Ri^) ^ Ri^eii^)) (projecting the argument onto does not increase R). 
Then, for any G Q, at each round t of Alg. 1, 

L{Ot) + XR{et+i) < L{e) + \R{e) + + \\~^ -^t\? - W'e -Ot+if _ ^^^^ 

2 2r?t 
If in addition, L is a- strongly convex, then the bound in{\\) can be strengthened to 

m) + Ai?(0,+i) < m + XR{e) + + ll^-^tf -ll^-gmf _^\\-0_e,\\\ (12) 

2 2r]t 2 

A related, but less tight, bound for J = 1 was derived in Duchi and Singer (2009); instead of 
our term ^G"^ in (11), the bound of (Duchi and Singer, 2009) has 7f G^."* When R = \\ ■ ||i, FOBOS 
becomes the truncated gradient algorithm of Langford et al. (2009) and our bound matches the one 
therein derived, closing the gap between (Duchi and Singer, 2009) and (Langford et al., 2009). The 
classical result in Prop. 2, relating Moreau projections and Fenchel duality, is the crux of our bound, 
via Corollary 3. Finally, note that the conditions (i)-(iii) are not restrictive: they hold whenever the 
proximity operators are shrinkage functions {e.g., ii Rj = ||^||pj , withpj, qj > 1). 

We next chai^acterize Alg. 1 in terms of its cumulative regret w.r.t. the best fixed hypothesis, i.e., 

T T 

RegT = i^^i^t) + L{Ot; xt, yt)) - min ^ {\R{e) + L(6>; xt, yt)) ■ (13) 
t=i ^~ t=i 

Proposition 5 (regret bounds witli fixed and decaying learning rates) Assume the conditions of Lemma 4, 

along with R>0 and R{0) = 0. Then: 

1. Running Alg. 1 with fixed learning rate r] yields 

RegT<^G2 + i^, where9* = 8.Tgmmy{XR{e)+L{e;xt,yt)). (14) 
2 27] eeB 



Setting 7] = \\6* ||/(GvT) yields a sublinear regret of \\d* \\GvT. (Note that this requires knowing 
in advance \\9*\\ and the number of rounds T.) 



^This can be seen from their Eq. 9, setting A — Q and -qt = ?7(_|_ j 
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2. Assume that Q is bounded with diameter F (i.e., "^0,6' € Q, \\0 — 6'\\ < F). Let the learning rate 
be rjt = tjq/ \/t, with arbitrary r/o > 0. Then, 

Optimizing the bound gives tjq = F/{\/2G), yielding Reg^^ < FG^/TT. 

3. If L is a-strongly convex, and rjt = l/{at), we obtain a logarithmic regret bound: 

RegT<G\l + logT)/{2a). (16) 

Similarly to other analyses of online learning algorithms, once an online-to-batch conversion is 
specified, regret bounds allow us to obtain PAC bounds on optimization and generalization errors. 
The following proposition can be proved using the same techniques as in (Cesa-Bianchi et al., 2004; 
Shalev-Shwartz et al, 2007). 

Proposition 6 (optimization and estimation error) If the assumptions of Prop. 5 hold and rjt = 
fJo/Vi as in 2., then the version of Alg. 1 that returns the averaged model solves the optimization 
problem (10) with accuracy e inT = 0{{F^G^ + log(l/5))/e^) iterations, with probability at least 
1 — 5. If L is also a-strongly convex and r]t = l/{at) as in 3., then, for the version of Alg. I that 
returns Ot+i, we get T = 0{G'^ / {a6e)). The generalization bounds are of the same orders. 

We now pause to see how the analysis applies to some concrete cases. The requirement that the loss 
is G-Lipschitz holds for the hinge and logistic losses, where G = 2 max^jg^^ || (see Appendix E). 

These losses are not strongly convex, and therefore Alg. 1 has only 0(l/e^) convergence. If the 
regularizer R is cr-strongly convex, a possible workaround to obtain 0(l/e) convergence is to let L 
"absorb" that strong convexity by redefining L{6;xt,yt) = L{0;xt,yt) + (7||0|p/2. Since neither 
the ^2,1-norm nor its square are strongly convex, we cannot use this trick for the MKL case (7), but 
it does apply for non-sparse MKL (Kloft et al., 2010) (^2,g-iiorms are strongly convex for q > 1) 
and for elastic net MKL (Suzuki and Tomioka, 2009). Still, the 0(l/e^) rate for MKL is competitive 
with the best batch algorithms; e.g., the method in Xu et al. (2009) achieves e primal-dual gap in 
0(l/e^) iterations. Some losses of interest (e.g., the squared loss, or the modified loss L above) are 
G-Lipschitz in any compact subset of M'^ but not in M'^. However, if it is known in advance that the 
optimal solution must lie in some compact convex set @, we can add a vacuous constraint and run 
Alg. 1 with the projection step, making the analysis still applicable; we present concrete examples in 
Appendix E. 



3.4 Online MKL 

The instantiation of Alg. 1 for R{0) = |||0||2 i yields Alg. 3. We consider L = Lsvm; adapting to 
any generalized linear model (e.g., L = Lcrf) is straightforward. As discussed in the last paragraph 
of §3.3, it may be necessary to consider "vacuous" projection steps to ensure fast convergence. Hence, 
an optional upper bound 7 on ||0|| is accepted as input. Suitable values of 7 for the SVM and CRF 
case are given in Appendix E. In line 4, the scores of candidate outputs are computed groupwise; 
in structured prediction (see §2), a factorization over parts is assumed and the scores are for partial 
output assignments (see Taskar et al. (2003); Tsochantaridis et al. (2004) for details). The key novelty 
of Alg. 3 is in line 8, where the group structure is taken into account, by applying a proximity operator 
which corresponds to a groupwise shrinkage/thresolding, where some groups may be set to zero. 
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Algorithm 3 Online-MKL 



1: input: V, A, T, radius 7, learning rate sequence {r]t)t=i,...,T 

2: initialize 0^ ^ 

3: for t = 1 to r do 

4: take an instance xt, yt and compute scores fk{xt, ui) = (^L 4>kixt, y't)), for A; = 1, . . . ,p 

5: decode: ijt G argmaXj^/g3;(^) XlLi fk{xt, v't) + ({y't, Vt) 

6: Gradient step: Oj, = e\ - r]t{ct)k{xt, yt) - (f^kixt, yt)) 

7: compute weight_s^^= ||^J|,_A; = 1,^. .. ,p, and shrink them b* = prox^^;^jl jp (b*) with Alg. 2 
8: Proximal step: 0^ = • 0^, for A; = 1, . . . ,p ^'^ 

9: Projection step: 0*+^ = 0*"^^ • min{l,7/||0*^^||} 

10: end for 

11: compute/3fc = ||0^+i||/ELil|6'r+'l|,forA: = l,...,p 

12: return /3, and the last model 0^+^ 



Although Alg. 3 is written in parametric form, it can be kemelized, as shown next (one can also 
use explicit features in some groups, and implicit in others). Observe that the parameters of the /cth 
group after round t can be written as O'^^^ = a*f^^^{(f)j^{xs,ys) — 4>k{xs, lis)), where 

„m_^ TJ ffhr /lr\^;„r^ ^,/iia'+hix\ _ / (^fc/^D min{l, 7/ || } if s = t 



l[({bl/bl)mm{l,j/\\e' 



Therefore, the inner products in line 4 can be kernelized. The cost of this step is 0(min{m, t}), 
instead of the 0{dk) (where dk is the dimension of the A;th group) for the explicit feature case. After 
the decoding step (line 5), the supporting pair [xt, yt) is stored. Lines 7, 9 and 1 1 require the norm 
of each group, which can be manipulated using kernels: indeed, after each gradient step (line 6), we 
have (denoting ut = {xt, yt) and ut = {xt, yt))- 

ll^fef = W^iW^ -'^vt{0i,4>k{xt,yt)) + vtUk{xt,yt) - (t>kixt,yt)f 

= \\eif -2r]tfk{ut)+vhKk{ut,ut) + Kkiut,ut)-2Kkiut,ut)y, (17) 

and the proximal and projection steps merely scale these norms. When the algorithm terminates, it 
returns the kernel weights (3 and the sequence (a^^^). 

In case of sparse explicit features, an implementation trick analogous to the one used in (Shalev- 
Shwartz et al., 2007) (where each Ok is represented by its norm and an unnormalized vector) can 
substantially reduce the amount of computation. In the case of implicit features with a sparse kernel 
matrix, a sparse storage of this matrix can also significantly speed up the algorithm, eliminating its 
dependency on m in line 4. Note also that all steps involving group-specific computation can be 
carried out in parallel using multiple machines, which makes Alg. 3 suitable for combining many 
kernels (large p). 

4 Experiments 

Handwriting recognition. We use the OCR dataset of Taskar et al. (2003) (www . cis . upenn . edu/ 

~taskar/ocr), which has 6877 words written by 150 people (52152 characters). Each character is a 
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Kernel 


Training 


Test Acc. 




Runtimes 


(per char.) 


Linear (L) 


6 sec. 


72.8 ± 4.4% 


Quadratic (Q) 


1 16 sec. 


85.5 ± 0.3% 


Gaussian (G) (a^ — 5) 


123 sec. 


84.1 ± 0.4% 


Average (L + Q + G)/3 


118 sec. 


84.3 ± 0.3% 


MKL f3iL + I32Q + PsG 


279 sec. 


87.5 ± 0.4% 


Si-Spline (Si) 


8 sec. 


75.4 ± 0.9% 


Average (L + Bi)/2 


15 sec. 


83.0 ± 0.3% 


MKL PiL + 


15 sec. 


85.2 ± 0.3% 



Table 1: Results for handwriting recognition. Averages over 10 runs on the same folds as in (Taskar 
et al., 2003), training on one and testing on the others. The linear and quadratic kernels are normalized 
to unit diagonal. In all cases, 20 epochs were used, with rjQ in (15) picked from {0.01, 0.1, 1, 10} by 
selecting the one that most decreases the objective after 5 epochs. Results are for the best regulariza- 
tion coefficient C = l/(Am) (chosen from {0.1, 1, 10, 10^, 10^ 10^}). 

16-by-8 binary image, i.e., a 128-dimensional vector (our input) and has one of 26 labels (a-z; the 
outputs to predict). Like in (Taskar et al., 2003), we address this sequence labeling problem with a 
structured SVM; however, we learn the kernel from the data, via Alg. 3. We use an indicator basis 
function to represent the correlation between consecutive outputs. Our first experiment (reported in 
the upper part of Tab. 1) compares linear, quadratic, and Gaussian kernels, either used individually, 
combined via a simple average, or with MKL. The results show that MKL outperforms the others by 
2% or more. 

The second experiment aims at showing the ability of Alg. 3 to exploit hoth feature and kernel 
sparsity by learning a combination of a linear kernel (explicit features) with a generalized Bi-spline 
kernel, given by A'(x, x') = max{0, 1 — ||x — x'||//i}, with h chosen so that the kernel matrix has 
~ 95% zeros. The rationale is to combine the strength of a simple feature-based kernel with that of 
one depending only on a few nearest neighbors. The results (Tab. 1, bottom part) show that the MKL 
outperforms by ~ 10% the individual kernels, and by more than 2% the averaged kernel. Perhaps more 
importantly, the accuracy is not much worse than the best one obtained in the previous experiment, 
while the runtime is much faster (15 versus 279 seconds). 

Dependency parsing. We trained non-projective dependency parsers for English, using the dataset 
from the CoNLL-2008 shared task Surdeanu et al. (2008) (39278 training sentences, ~ lO*" tokens, 
and 2399 test sentences). The output to be predicted from each input sentence is the set of dependency 
arcs, linking heads to modifiers, that must define a spanning tree (see example in Fig. 1). We use arc- 
factored models, where the feature vectors decompose as cf){x,y) = J2(hm)ey^h,m{^)- Although 
they are not the state-of-the-art for this task, exact inference is tractable via minimum spanning tree 
algorithms (McDonald et al., 2005). We defined 507 feature templates for each candidate arc by 
conjoining the words, lemmas, and parts-of-speech of the head h and the modifier m, as well as the 
parts-of-speech of the surrounding words, and the distance and direction of attachment. This yields a 
large scale problem, with > 50 million features instantiated. The feature vectors associated with each 
candidate arc, however, are very sparse and this is exploited in the implementation. We ran Alg. 3 
with explicit features, with each group standing for a feature template. MKL did not outperform a 
standard SVM in this experiment (90.67% against 90.92%); however, it showed a good performance 
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root John hit the ball with the bat 




Figure 1: Top: a dependency parse tree (adapted from (McDonald et al., 2005)). Bottom left: group 
weights along the epochs of Alg. 3. Bottom right: results of standard SVMs trained on sets of fea- 
ture templates of sizes {107,207,307,407,507}, either selected via a standard SVM or by MKL 
(the UAS — unlabeled attachment score — is the fraction of non-punctuation words whose head was 
correctly assigned.) 

at pruning irrelevant feature templates (see Fig. 1, bottom right). Besides interpretability , which may 
be useful for the understanding of the syntax of natural languages, this pruning is also appealing in 
a two-stage architecture, where a standard learner at a second stage will only need to handle a small 
fraction of the templates initially hypothesized. 

5 Conclusions 

We introduced a new class of online proximal algorithms that extends FOBOS and is applicable 
to many variants of MKL and group-LASSO. We provided regret, convergence, and generalization 
bounds, and used the algorithm for learning the kernel in large-scale structured prediction tasks. 

Our work may impact other problems. In structured prediction, the ability to promote structural 
sparsity suggests that it is possible to learn simultaneously the structure and the parameters of the 
graphical models. The ability to learn the kernel online offers a new paradigm for problems in which 
the underlying geometry (induced by the similarities between objects) evolves over time: algorithms 
that adapt the kernel while learning are robust to certain kinds of concept drift. We plan to explore 
these directions in future work. 
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A Proof of Proposition 1 

We have respectively: 



1, 



M^(xi, . . . ,Xp) = min^||y-x|p + 99(y) 
y 2 



mm 
yiv,yp 2 



1 ^ 



k=l 



xfeir + V'(l|yi||,---,l|ypll) 



min ^(m, ...,Up)+ mill - 
■^^'^ y:||yfell=«fc,Vfc 2 ^ 



ue 



mm 



V'(ni,...,^/p) + ^^^^ min_^^ ||yfc-Xfcf 
1 ^ 

min ^(ui,...,Up) + -^ 



2 ^yfc^llyfclN"*; 



fc=i 



mm 

ueR; 



1 

a V(ni, . . . , Up) + - ^(ufc - ||xfc 



k=l 



) • • • ) 1 1 Xp 1 1 j , 



where the solution of the innermost minimization problem in (*) is y/j 
[prox^(xi, . . . ,Xp)]fc = [prox^(||xi||, . . . , ||xp||)]fci 



l|xfe| 



(18) 

li-k, and therefore 



Xfc 



B Proof of Corollary 3 

We start by stating and proving the following lemma: 

Lemma 7 Let Lp -.W ^M.be as in Prop. 2, and let x = prox (x). Then, any y G satisfies 



(x - y)^(x - x) < (/7(y) - (/?(x) 



Proof: From (8), we have that 



1 



^ "x - x|p + (/?(x) + ^||x|p + ¥?*(x - x) 



> 



X - x|p + 99(x) + ^ ||x|p + sup ( U ' (x - x) - ip{\l] 

2 ,,(=11? 



X — X 



'/'(x) + ||xf + y^(x - x) - if{y) 



x||^ + x^(x - x) + y^(x - x) - Lp{y) + (/?(x) 
xf + (x - y)^(x - x) - ip{y) + V7(x), 



(19) 



from which (19) follows. 
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Now, take Lemma 7 and bound the left hand side as: 

(x-y)T(x-x) > (x-y)T(x-x)-^||x-x||2 

= (x - y)T(x - x) - ^||xf - ^||x||2 + x^x 

1 II _ Il2 T f- \ 1 II ||2 
■12 1„„ „„2 



|y-x|| --||y-x|| 



This concludes the proof. 



C Proof of Lemma 4 

Let u{e, 9) = XR{9) - XR{9). We have successively: 

\\0-9t+i\\^ <« \\9-9t+if 



J 

<(ii) \\e - ~9tf + 2r,tXj2{Rm - RjiOt+j/j)) 

i=i 

<(m) \\9-9tf + 2r]tu{9,9t+i) 

<(iv) \\0-etf + 2r]tu{9,9t+i) 
= \\9 - 9tf + \\9t - 9tf + 2(0 - 6>i)T(6>t - 9t) + 2r,tu{9, 9t+i) 
= \\9 - 9tf + r/2||g||2 + 2r]t{9 - 9t)^g + 2r]tu{9, 9t+i) 

<(v) _ + r^2||g||2 + 2rit{L{9) - L{9t)) + 2r^tu{9, 9t+i) 
< \\9-9t\\^ + r^^G^ + 2r,t{L{9)-L{9t)) + 2r,tu{9,9t+i), (20) 

where the inequality (i) is due to the nonexpansiveness of the projection operator, (ii) follows from 
applying Corollary 3 J times, (iii) follows from applying the inequality Rj{9fj^i^j) > i?j(0(+(;+i)/j) 
for Z = j, . . . , J — 1, (iv) results from the fact that R{9t+i) > R{JlQ,{9t+i)), and (v) results from 
the subgradient inequality of convex functions, which has an extra term — if L is a-strongly 
convex. 
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D Proof of Proposition 5 

Invoke Lemma 4 and sum for i = 1, . . . , T, which gives 



Y,{L[euxt,yt) + \R{et)) 

t=l 

T 

t=i 

T 

<« ^(L(0t;xt,yt) + Ai?(0i+i)) 



t=i 



^ r'2 lla* a l|2 \\a* a ||2 

t=i ^ i=i t=\ 

{L{e*,xu yt) + Ai?(r )) + ^Yr^, + \Y(^ - ^Y\\e* - e,f 



t=l t=l t=2 

?T+lf (21) 



+— -we* -Oif - — -we* -Gt 

27/1 " " 2r?T " 

where the inequality (i) is due to the fact that 6i = 0. Noting that the third term vanishes for a constant 
learning rate and that the last term is non-positive suffices to prove the first part. For the second part, 
we continue as: 

T 

Y{L{euXt,yt) + \R{et)) 
t=l 

= iLiO*;xt, yt) + XR{e*)) + ^ E + 1^ 

t=i t=i 

<(") Y (L{0*;xt, yt) + \R{e*)) + G%o(Vr - 1/2) + 
t=l 

< YiL{0*;xt,yt) + XR{e*))+(G'vo + ^^VT, (22) 



t=i 
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where equality (ii) is due to the fact that Yl't=i ^ 2\/T — 1. For the third part, continue after 
inequality (i) as: 

T 

t=l 

1 1 



< iL{0*;xt, yt) + XR{e*)) + ^ E ^* + ^ E 



a ■ we* - Of 



2a 

t=i 



where the equality (iii) is due to the fact that Ylt=i i — ^ + T. 



|2 



t=i t=i t=2 ^ ^ 

+U--A-\\o*-o,f-^.\\e*-OT+,f 

2 \r]i ) 2riT 

T (-12 -I rp 

= E ^L{e*-,xt, yt) + XR{e*)) + Y —.\\e*- e^+if 
t=i t=i 

t=i t=i 
<(-) Y,iLiO*;xt,yt) + XRie*)) + ^il + logT), (23) 



E Lipschitz Constants of Some Loss Functions 

Let 0* be a solution of the problem (10) with = M'^. For certain loss functions, we may obtain 
bounds of the form || < 7 for some 7 > 0, as the next proposition illustrates. Therefore, we may 
redefine = {0 G R'^ | \\6\\ < 7} (a vacuous constraint) without affecting the solution of (10). 

Propositions Let R{6) = \(YTk=i ll^fcll)^- Let La,\y[ and Lq-^y be the structured hinge and logistic 
losses (4). Assume that the average cost function ( in the SVM case) or the average entropy ( in the CRF 
case) are bounded by some A > 0, /.e,"* 

^ m ^ m 

-V max ^(y^;y,)<A or - V F(y,) < A. (24) 

1=1 1=1 

Then: 

L The solution 0/ (10) with G = M'^ satisfies \\e*\\ < y/2A/X. 

2. L is G-Lipschitz on W^, with G = 2max„g^ 

3. Consider the following problem obtained from (10) by adding a quadratic term: 

mill + xR{e) + - E ^(^; y^)- (25) 

i=l 

The solution of this problem satisfies || < iy2A/(A + a\ 

'*In sequence binary labeling, we have A = N for the CRF case and for the SVM case with a Hamming cost function, 
where N is the average sequence length. Observe that the entropy of a distribution over labelings of a sequence of length TV 
is upper bounded by log 2^ = N. 
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Algorithm 4 Moreau projection for the squared weighted ^i-norm 



Input: A vector xq G W, a weight vector d > 0, and a parameter A > 
Set UQr = \xor\/dr and = d"^ for each r = 1,. . . ,p 
Sort uq: no(i) > . • • > no(p) 

Find p = max \j £{l,...,p} \ Mo(i) - ttt^ Er=i o-(r)Uo(r) > ^ 

Compute u = soft(uo, r), where r = i_^_xjy — a — Sr=i "M'^oM 
Output: X s.t. Xr = sign{xor) drUr- 

4. The modified loss L = L + f ||.|p is G-Lipschitz on ^6 \ \\0\\ < \/2A/ (A + where G - 
G+^2a^K/{\ + a). 

Proof: Let Fsvm(^) and Fcrf(^) be the objectives of (10) for the SVM and CRF cases. We have 



^svm(O) = Ai?(0) H y^LsYM{Q]Xi,yi) = —^ max ^(y^; < ^SVM (26) 

^ m m 

Fcrf(O) = Ai?(0) + - VLcRF(0;xi,2/i) = ^^CRF (27) 



m — ' m 



Using the facts that F{0*) < F{0), that the losses are non-negative, and that l^;*!)^ > J2i 
obtain < XR{0*) < F{e*) < F(0), which proves the first statement. 

To prove the second statement for the SVM case, note that a subgradient of LsvM at 6 is gsvM = 
(t>{x,y) - (t){x,y), where y = argmaxyg3;(^) e^{ct){x,y') - cj){x,y)) + l{y';y); and that the gra- 
dient of -LcRF at is gcRF = — cj){x,y). Applying Jensen's inequality, we have that 
llgCRFll < E0\\4>{x,Y) - cj){x,y)\\. Therefore, both ||gsvM|| and ||gCRF|| are upper bounded by 
raax^ex,y,y'ey{x) H{x,y') - < 2max„6t/ ||</>(n)||. 

The same rationale can be used to prove the third and fourth statements. ■ 



F Computing the proximity operator of the (non-separable) squared ii 

We present an algorithm (Alg. 4) that computes the Moreau projection of the squared, weighted £i- 
norm. Denote by the Hadamard product, [a b]^ = a^bk- Letting A,d > 0, and </>d(x) = 
2||d x||f , the underlying optimization problem is: 

^A0d(xo) = mill -||x - xof + - . (28) 

\i=l / 

This includes the squared ^i-norm as a particular case, when d = 1 (the case addressed in Alg. 2). 
The proof is somewhat technical and follows the same procedure employed by Duchi et al. (2008) to 
derive an algorithm for projecting onto the £i-ball. The runtime is 0{plogp) (the amount of time that 
is necessary to sort the vector), but a similar trick as the one described in (Duchi et al., 2008) can be 
employed to yield 0{p) runtime. 

Lemma 9 Let x* = prox;s^j^^(xo) be the solution of (28). Then: 
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1. X* agrees in sign with xq, i.e., each component satisfies XQi ■ x* > 0. 

2. Let a G { — 1, 1}^. Then prox;)^^^(<T xq) = <t prox;^^^(xo), i.e., flipping a sign in xq 
produces a x* with the same sign fiipped. 

Proof: Suppose that xoi • x* < for some i. Then, x defined by Xj = x* for j / i and Xi = —x* 
achieves a lower objective value than x*, since (/>d(x) = 4>^{'x*) and (xj — xoi)^ < {^* — a^oi)^' this 
contradicts the optimality of x*. The second statement is a simple consequence of the first one and 
that0d,A(cr0x) = 0d,A(cr0x*). ■ 

Lemma 9 enables reducing the problem to the non-negative orthant, by writing xq = <t • xq, with 
Xq > 0, obtaining a solution x* and then recovering the true solution as x* = <t • x*. It therefore 
suffices to solve (28) with the constraint x > 0, which in turn can be transformed into: 

min F(u) = -'^ar{ur - uorf + 2 ( "^""^ ) ' 

r=l \r=l / 

where we made the change of variables Oj = d^, uqi = xoi/dt and Ui = Xi/di. 

The Lagrangian of (29) is £(u, ^) = \ Yl^r=i o-riur — uqt)'^ + | iYlr=i o-rUr) — ^^u, where 
^ > are Lagrange multipliers. Equating the gradient (w.r.t. u) to zero gives 

p 

a (u — uq) + A ^ arUrSi — ^ = 0. (30) 

From the complementary slackness condition, Uj > implies = 0, which in turn implies 

p 

^ji'^j ~ "^Oj) + ^O-j ttrUr = 0. (31) 
r=l 

Thus, if Uj > 0, the solution is of the form Uj = uoj — r, with r = A Ylr=i (^rUr- The next lemma 
shows the existence of a split point below which some coordinates vanish. 

Lemma 10 Let u* be the solution of (29). Ifu*j^ = and u^j < ligfc, then we must have u* = 0. 

Proof: Suppose that u* = e > 0. We will construct a u whose objective value is lower than F(u*), 
which contradicts the optimality of u*: set in = for / ^ {j, k}, Uk = ec, and Uj = e (1 — cak/aj), 
where c = mm{aj /ak, 1}. We have Ylr=i o-rU* = Ylr=i o-rUr, and therefore 

p p 

2(F(u) - F(u*)) = ar{Ur - Uprf - - Uprf 

r=l r=l 

= aj{uj - uojf - aj{u* - UQjf + ak{uk - UQuf - ak{u*k - UQuf .{ItT) 

Consider the following two cases: (i) if Oj < a^, then Uk = eaj/au and Uj = 0. Substituting 
in (32), we obtain 2(F(u) — F(u*)) = {aj/ak — o,j^ < 0, which leads to the contradiction 
F{iL) < F(u*). If (ii) aj > a^, then % = e and Uj = e (1 — ak/aj). Substituting in (32), we obtain 
2(F(u)-F(u*)) = aje"^ (1 - ak/aj)'^+2akeuoj-2akeuok+ake'^-aje'^ < a\/ aje^ -2ake^ +ake^ = 
{a\/aj — ak) < 0, which also leads to a contradiction. ■ 
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which implies 



Let lio(i) > . . . > ?Xo(p) be the entries of uq sorted in decreasing order, and let u*^iy . . . , u^^^ be 
the entries of u* under the same permutation. Let p be the number of nonzero entries in u* , i.e., 
u*^^^ > 0, and, if p < p, ^^(p+i) = 0. Summing (31) for (j) = 1, we get 

p p f ^ \ ^ 

X] Hr)Ulr) - ^ a(r) W0(r) + I ^ ^(r) 1 ^ ^ a(r)^i(r.) = 0> (33) 
r=l r=l \r=l / r=l 

VP P 

Y.< = Y. = ^^^srP a,, Yl «{r)^o(r) , (34) 

and therefore r = ^^^^p — - — X]r=i '^(r)^o(r)- The complementary slackness conditions for r = p 
and r = p + 1 imply 

p p 
^Ip) - ^o(p) + ^ X] Hr)^h = and - n*(^^i) + ^Yl "(O^W = ^(p+i) ^ (35) 

r=l r=l 

therefore Uo(p) > Uo{p) ~ ^(p) = > ''^o(/3+i)- This implies that p is such that 

A 

^o(p) > 1 , \ — ^ ^ (^{r)Uo{r) > %(p+l) • (36) 

The next proposition goes farther by exactly determining p. 
Proposition 11 The quantity p can be determined via: 

p = max < j G [p] no(j) - —4 Y] a(r)^^o(r) > I . (37) 

Proo/- Let p* = max{j|ti^^.-j > 0}. We have that u^-^ = no(r) - t* for r < p* , where r* = 

7* Z]r=i '^(r)^o(r)» ^^1^ therefore p > p*. We need to prove that p < p*, which we will do 

by contradiction. Assume that p > p*. Let u be the vector induced by the choice of p, i.e., n(,,.) = for 
r > p and = no(r) — r for r < p, where r = i^xjy — X]r=i '^{r)^o{r)- From the definition 
of p, we have n(p) = tio(p) — t > 0, which imphes = no(r-) — r > for each r < p. In addition, 

p 



/A a \ 

r=l r=l r=l ^ ^''=1 (^■) ^ r=l 

1 r 
= , 5Z"M^o(r) = y, (38) 



p p* p p 

r=l r=l r=p*+l r=p+l 

P* P 



< ^a(r)T^+ X] "M"o(r)- (39) 

r=l r=p*+l 
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We next consider two cases: 

From (39), we have that YZ=i "'Aur - nor)^ < Er=i "(r)^^ + Er=p*+i «(r)^o(r) - 



r* > T. 



Sr=i 0'(r)ir*f+Ylr=p*+i «(r)«o(r) = Sr=i ar«-iiOr)^- From(38), wchavcthat a,.t 
diction. 



jy? < (r*)^/A^. Summing the two inequalities, we get F{u) < F{u*), which leads to a contra- 



r* < T. 



We will construct a vector u from u* and show that F{u) < F(u*). Define 



u 



(r) 



2a, 



a(p*)+a(p*+l) 



e, 



if r = p* + 1 (40) 



u^^-j otherwise. 



where e = (mo(p*+i) ~ Note that X]r=i ~ X^f=i '^r^**- From the assumptions that 

T* < r and /o* < p, we have that n*^,^^^ = no(p*+i) — r > 0, which implies that n(p.+i) = 

"(P*)("0(P* + 1)~^*) a(p*)("0(p* + l)-^) _ °(p'')'"(p* + i) „„j.hnf-),* — „ ST* "(P* + l)("0(P* + l)-^*) 

a(p*)+a(p.+l) > a,,*,+a(,.+i) " a,,*)+a,,*+i) > that U^^.^ - no(p.) T 

^ "(p*+i)"o(p*+i) _fl ) r* >« f 1 ) (^xg. , , i^-r) =(l 

0, where inequality (i) is justified by the facts that uo(p*) > no{p*+i) and t > t*. This ensures that u 
is well defined. We have: 

p p 

2{F{U*) - F{u)) = ^ar{u;-UOr)^ -^ar{Ur-UOr)^ 



1) 



r=l r=l 

r *\2 , 2 f * , 2a(p*+i)e 



a(p*) +«(P*+1) 

a(p*+i) ?^o(p*+i) 



2a(p*)e 



Hp*) + a(p*+i) 

~ V ~ '^0(p* + l)j 2 2 

a(p.) + a(p.+i) V ' (a(p,) + a(p.+i)) (a(p.) + a(p*+i)) 

= !^(^!)^(^>o, (41) 
«(p*) + «(p*+i) 

which leads to a contradiction and completes the proof. ■ 
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